15 research outputs found
Stochastic Shortest Path with Energy Constraints in POMDPs
We consider partially observable Markov decision processes (POMDPs) with a
set of target states and positive integer costs associated with every
transition. The traditional optimization objective (stochastic shortest path)
asks to minimize the expected total cost until the target set is reached. We
extend the traditional framework of POMDPs to model energy consumption, which
represents a hard constraint. The energy levels may increase and decrease with
transitions, and the hard constraint requires that the energy level must remain
positive in all steps till the target is reached. First, we present a novel
algorithm for solving POMDPs with energy levels, developing on existing POMDP
solvers and using RTDP as its main method. Our second contribution is related
to policy representation. For larger POMDP instances the policies computed by
existing solvers are too large to be understandable. We present an automated
procedure based on machine learning techniques that automatically extracts
important decisions of the policy allowing us to compute succinct human
readable policies. Finally, we show experimentally that our algorithm performs
well and computes succinct policies on a number of POMDP instances from the
literature that were naturally enhanced with energy levels.Comment: Technical report accompanying a paper published in proceedings of
AAMAS 201
3DGen: Triplane Latent Diffusion for Textured Mesh Generation
Latent diffusion models for image generation have crossed a quality threshold
which enabled them to achieve mass adoption. Recently, a series of works have
made advancements towards replicating this success in the 3D domain,
introducing techniques such as point cloud VAE, triplane representation, neural
implicit surfaces and differentiable rendering based training. We take another
step along this direction, combining these developments in a two-step pipeline
consisting of 1) a triplane VAE which can learn latent representations of
textured meshes and 2) a conditional diffusion model which generates the
triplane features. For the first time this architecture allows conditional and
unconditional generation of high quality textured or untextured 3D meshes
across multiple diverse categories in a few seconds on a single GPU. It
outperforms previous work substantially on image-conditioned and unconditional
generation on mesh quality as well as texture generation. Furthermore, we
demonstrate the scalability of our model to large datasets for increased
quality and diversity. We will release our code and trained models
CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding
Indoor scene synthesis involves automatically picking and placing furniture
appropriately on a floor plan, so that the scene looks realistic and is
functionally plausible. Such scenes can serve as homes for immersive 3D
experiences, or be used to train embodied agents. Existing methods for this
task rely on labeled categories of furniture, e.g. bed, chair or table, to
generate contextually relevant combinations of furniture. Whether heuristic or
learned, these methods ignore instance-level visual attributes of objects, and
as a result may produce visually less coherent scenes. In this paper, we
introduce an auto-regressive scene model which can output instance-level
predictions, using general purpose image embedding based on CLIP. This allows
us to learn visual correspondences such as matching color and style, and
produce more functionally plausible and aesthetically pleasing scenes.
Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene
synthesis and improves auto-completion metrics by over 50%. Moreover, our
embedding-based approach enables zero-shot text-guided scene synthesis and
editing, which easily generalizes to furniture not seen during training.Comment: Changed paper template and cleaned up table
Compressing Video Calls using Synthetic Talking Heads
We leverage the modern advancements in talking head generation to propose an
end-to-end system for talking head video compression. Our algorithm transmits
pivot frames intermittently while the rest of the talking head video is
generated by animating them. We use a state-of-the-art face reenactment network
to detect key points in the non-pivot frames and transmit them to the receiver.
A dense flow is then calculated to warp a pivot frame to reconstruct the
non-pivot ones. Transmitting key points instead of full frames leads to
significant compression. We propose a novel algorithm to adaptively select the
best-suited pivot frames at regular intervals to provide a smooth experience.
We also propose a frame-interpolater at the receiver's end to improve the
compression levels further. Finally, a face enhancement network improves
reconstruction quality, significantly improving several aspects like the
sharpness of the generations. We evaluate our method both qualitatively and
quantitatively on benchmark datasets and compare it with multiple compression
techniques. We release a demo video and additional information at
https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression.Comment: British Machine Vision Conference (BMVC), 202